121 research outputs found
Learning Models over Relational Data using Sparse Tensors and Functional Dependencies
Integrated solutions for analytics over relational databases are of great
practical importance as they avoid the costly repeated loop data scientists
have to deal with on a daily basis: select features from data residing in
relational databases using feature extraction queries involving joins,
projections, and aggregations; export the training dataset defined by such
queries; convert this dataset into the format of an external learning tool; and
train the desired model using this tool. These integrated solutions are also a
fertile ground of theoretically fundamental and challenging problems at the
intersection of relational and statistical data models.
This article introduces a unified framework for training and evaluating a
class of statistical learning models over relational databases. This class
includes ridge linear regression, polynomial regression, factorization
machines, and principal component analysis. We show that, by synergizing key
tools from database theory such as schema information, query structure,
functional dependencies, recent advances in query evaluation algorithms, and
from linear algebra such as tensor and matrix operations, one can formulate
relational analytics problems and design efficient (query and data)
structure-aware algorithms to solve them.
This theoretical development informed the design and implementation of the
AC/DC system for structure-aware learning. We benchmark the performance of
AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting
and advertisement planning applications, AC/DC can learn polynomial regression
models and factorization machines with at least the same accuracy as its
competitors and up to three orders of magnitude faster than its competitors
whenever they do not run out of memory, exceed 24-hour timeout, or encounter
internal design limitations.Comment: 61 pages, 9 figures, 2 table
Efficiently decodable non-adaptive group testing
We consider the following "efficiently decodable" non-adaptive
group testing problem. There is an unknown string
x 2 f0; 1gn [x is an element of set {0,1} superscript n] with at most d ones in it. We are allowed to test
any subset S [n] [S subset [n] ]of the indices. The answer to the test
tells whether xi = 0 [x subscript i = 0] for all i 2 S [i is an element of S] or not. The objective
is to design as few tests as possible (say, t tests) such that
x can be identifi ed as fast as possible (say, poly(t)-time).
Efficiently decodable non-adaptive group testing has applications
in many areas, including data stream algorithms and
data forensics.
A non-adaptive group testing strategy can be represented
by a t x n matrix, which is the stacking of all the
characteristic vectors of the tests. It is well-known that if
this matrix is d-disjunct, then any test outcome corresponds
uniquely to an unknown input string. Furthermore, we know
how to construct d-disjunct matrices with t = O(d2 [d superscript 2] log n)
efficiently. However, these matrices so far only allow for a
"decoding" time of O(nt), which can be exponentially larger
than poly(t) for relatively small values of d.
This paper presents a randomness efficient construction
of d-disjunct matrices with t = O(d2 [d superscript 2] log n) that can be decoded
in time poly(d) [function composed of] t log2 t + O(t2) [t log superscript 2 t and O (t superscript 2)]. To the best of our
knowledge, this is the first result that achieves an efficient decoding
time and matches the best known O(d2 log n) [O (d superscript 2 log n)] bound
on the number of tests. We also derandomize the construction,
which results in a polynomial time deterministic construction
of such matrices when d = O(log n= log log n).
A crucial building block in our construction is the
notion of (d,l)-list disjunct matrices, which represent the
more general "list group testing" problem whose goal is to
output less than d + l positions in x, including all the (at
most d) positions that have a one in them. List disjunct
matrices turn out to be interesting objects in their own right
and were also considered independently by [Cheraghchi,
FCT 2009]. We present connections between list disjunct
matrices, expanders, dispersers and disjunct matrices. List
disjunct matrices have applications in constructing (d,l)-
sparsity separator structures [Ganguly, ISAAC 2008] and in
constructing tolerant testers for Reed-Solomon codes in the
data stream model.
1 IntroductionDavid & Lucile Packard FoundationCenter for Massive Data Algorithmics (MADALGO)National Science Foundation (U.S.) (Grant CCF-0728645)National Science Foundation (U.S.) (Grant CCF-0347565)National Science Foundation (U.S.) (CAREER Award CCF-0844796
- …